Discovering Patterns and Subfamilies in Biosequences

نویسندگان

  • Alvis Brazma
  • Inge Jonassen
  • Esko Ukkonen
  • Jaak Vilo
چکیده

We consider the problem of automatic discovery of patterns and the corresponding subfamilies in a set of biosequences. The sequences are unaligned and may contain noise of unknown level. The patterns are of the type used in PROSITE database. In our approach we discover patterns and the respective subfamilies simultaneously. We develop a theoretically substantiated significance measure for a set of such patterns and an algorithm approximating the best pattern set and the subfamilies. The approach is based on the minimum description length (MDL) principle. We report a computing experiment correctly finding subfamilies in the family of chromo domains and revealing new strong patterns.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Reports in Informatics Relation Patterns and Their Automatic Discovery in Biosequences Relation Patterns and Their Automatic Discovery in Biosequences

We have extended the pattern language used in PROSITE to enable it to describe dependencies between amino acid residues. We have developed a minimum description length principle based tness measure evaluating the signiicance of such patterns in relation to a set of sequences, and an algorithm automatically nding signiicant patterns in unaligned sequences. Computing experiments are reported show...

متن کامل

Knowledge Discovery in Biosequences Using Sort Regular Patterns

This paper considers knowledge discovery by sort regular patterns, which are strings over sort letters representing nite sets of basic letters. We devise a learning algorithm for the class based on the minimal multiple generalization technique, and evaluate the method by experiments on biosequences from GenBank database. The experiments show that relatively a simple sort pattern can represent a...

متن کامل

An Efficient and Fast Algorithm for Mining Frequent Patterns on Multiple Biosequences

Mining frequent patterns on biosequences is one of the important research fields in biological data mining. Traditional frequent pattern mining algorithms may generate large amount of short candidate patterns in the process of mining which cost more computational time and reduce the efficiency. In order to overcome such shortcoming of the traditional algorithms, we present an algorithm named MS...

متن کامل

Reports in Informatics Approaches to the Automatic Discovery of Patterns in Biosequences

Approaches to the automatic discovery of patterns in biosequences. Abstract This paper is a survey of approaches and algorithms used for the automatic discovery of patterns in biosequences. Patterns with the expressive power in the class of regular languages are considered, and a classiication of pattern languages in this class is developed, covering those patterns which are the most frequently...

متن کامل

Permutation Pattern Discovery in Biosequences

Functionally related genes often appear in each other's neighborhood on the genome; however, the order of the genes may not be the same. These groups or clusters of genes may have an ancient evolutionary origin or may signify some other critical phenomenon and may also aid in function prediction of genes. Such gene clusters also aid toward solving the problem of local alignment of genes. Simila...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Proceedings. International Conference on Intelligent Systems for Molecular Biology

دوره 4  شماره 

صفحات  -

تاریخ انتشار 1996